In your final repo, there should be an R markdown file that organizes all computational steps for evaluating your proposed Facial Expression Recognition framework.

This file is currently a template for running evaluation experiments. You should update it according to your codes but following precisely th e same structure.

if(!require("EBImage")){
  install.packages("BiocManager")
  BiocManager::install("EBImage")
}
if(!require("R.matlab")){
  install.packages("R.matlab")
}
if(!require("readxl")){
  install.packages("readxl")
}

if(!require("dplyr")){
  install.packages("dplyr")
}
if(!require("readxl")){
  install.packages("readxl")
}

if(!require("ggplot2")){
  install.packages("ggplot2")
}

if(!require("caret")){
  install.packages("caret")
}

if(!require("glmnet")){
  install.packages("glmnet")
}

if(!require("WeightedROC")){
  install.packages("WeightedROC")
}

library(R.matlab)
library(readxl)
library(dplyr)
library(EBImage)
library(ggplot2)
library(caret)
library(glmnet)
library(WeightedROC)

Step 0 set work directories

set.seed(2020)
setwd("./")
# here replace it with your own path or manually set it in RStudio to where this rmd file is located. 
# use relative path for reproducibility

Provide directories for training images. Training images and Training fiducial points will be in different subfolders.

train_dir <- "../../train_set/" # This will be modified for different data sets.
train_image_dir <- paste(train_dir, "images/", sep="")
train_pt_dir <- paste(train_dir,  "points/", sep="")
train_label_path <- paste(train_dir, "label.csv", sep="") 

Step 1: set up controls for evaluation experiments.

In this chunk, we have a set of controls for the evaluation experiments.

run.cv <- TRUE # run cross-validation on the training set
sample.reweight <- FALSE # run sample reweighting in model training
K <- 5  # number of CV folds
run.feature.train <- TRUE # process features for training set
run.test <- TRUE # run evaluation on an independent test set
run.feature.test <- TRUE # process features for test set

Using cross-validation or independent test set evaluation, we compare the performance of models with different specifications. In this Starter Code, we tune parameter lambda (the amount of shrinkage) for logistic regression with LASSO penalty.

lmbd = c(1e-3, 5e-3, 1e-2, 5e-2, 1e-1)
model_labels = paste("LASSO Penalty with lambda =", lmbd)

Step 2: import data and train-test split

#train-test split
info <- read.csv(train_label_path)
n <- nrow(info)
n_train <- round(n*(4/5), 0)
train_idx <- sample(info$Index, n_train, replace = F)
test_idx <- setdiff(info$Index, train_idx) #get the index in info$Index but different from train_idx

If you choose to extract features from images, such as using Gabor filter, R memory will exhaust all images are read together. The solution is to repeat reading a smaller batch(e.g 100) and process them.

n_files <- length(list.files(train_image_dir))

image_list <- list()
for(i in 1:100){
   image_list[[i]] <- readImage(paste0(train_image_dir, sprintf("%04d", i), ".jpg"))
}

Fiducial points are stored in matlab format. In this step, we read them and store them in a list.

#function to read fiducial points
#input: index
#output: matrix of fiducial points corresponding to the index
readMat.matrix <- function(index){
     return(round(readMat(paste0(train_pt_dir, sprintf("%04d", index), ".mat"))[[1]],0))
}

#load fiducial points
fiducial_pt_list <- lapply(1:n_files, readMat.matrix)
save(fiducial_pt_list, file="../output/fiducial_pt_list.RData")

Step 3: construct features and responses

Figure1

feature.R should be the wrapper for all your feature engineering functions and options. The function feature( ) should have options that correspond to different scenarios for your project and produces an R object that contains features and responses that are required by all the models you are going to evaluate later.

source("../lib/feature.R")
tm_feature_train <- NA
if(run.feature.train){
  tm_feature_train <- system.time(dat_train <- feature(fiducial_pt_list, train_idx))
  save(dat_train, file="../output/feature_train.RData")
}else{
  load(file="../output/feature_train.RData")
}

tm_feature_test <- NA
if(run.feature.test){
  tm_feature_test <- system.time(dat_test <- feature(fiducial_pt_list, test_idx))
  save(dat_test, file="../output/feature_test.RData")
}else{
  load(file="../output/feature_test.RData")
}

Step 4: Train a classification model with training features and responses

Call the train model and test model from library.

train.R and test.R should be wrappers for all your model training steps and your classification/prediction steps.

source("../lib/train.R") 
source("../lib/test.R")

Model selection with cross-validation

  • Do model selection by choosing among different values of training model parameters.
source("../lib/cross_validation.R")
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label)
feature_train_new <- data.frame(read.csv("../"))
label_train_new <- data.frame(read.csv("../"))
feature_train_new <- as.matrix(feature_train_new[, -1])
label_train_new <- as.matrix(label_train_new[, -1])
if(run.cv){
  res_cv <- matrix(0, nrow = length(lmbd), ncol = 4)
  for(i in 1:length(lmbd)){
    cat("lambda = ", lmbd[i], "\n")
    res_cv[i,] <- cv.function(features = feature_train, labels = label_train, K, 
                              l = lmbd[i], reweight = sample.reweight)
  save(res_cv, file="../output/res_cv.RData")
  }
}else{
  load("../output/res_cv.RData")
}

Visualize cross-validation results.

  
res_cv <- as.data.frame(res_cv) 
colnames(res_cv) <- c("mean_error", "sd_error", "mean_AUC", "sd_AUC")
res_cv$k = as.factor(lmbd)

if(run.cv){
  p1 <- res_cv %>% 
    ggplot(aes(x = as.factor(lmbd), y = mean_error,
               ymin = mean_error - sd_error, ymax = mean_error + sd_error)) + 
    geom_crossbar() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  p2 <- res_cv %>% 
    ggplot(aes(x = as.factor(lmbd), y = mean_AUC,
               ymin = mean_AUC - sd_AUC, ymax = mean_AUC + sd_AUC)) + 
    geom_crossbar() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  print(p1)
  print(p2)
}
  • Choose the “best” parameter value
par_best <- lmbd[which.min(res_cv$mean_error)] # lmbd[which.max(res_cv$mean_AUC)]
  • Train the model with the entire training set using the selected model (model parameter) via cross-validation.
# training weights
weight_train <- rep(NA, length(label_train))
for (v in unique(label_train)){
  weight_train[label_train == v] = 0.5 * length(label_train) / length(label_train[label_train == v])
}
if (sample.reweight){
  tm_train <- system.time(fit_train <- train(feature_train, label_train, w = weight_train, par_best))
} else {
  tm_train <- system.time(fit_train <- train(feature_train, label_train, w = NULL, par_best))
}
save(fit_train, file="../output/fit_train.RData")

Step 5: Run test on test images

tm_test = NA
feature_test <- as.matrix(dat_test[, -6007])
if(run.test){
  load(file="../output/fit_train.RData")
  tm_test <- system.time({label_pred <- as.integer(test(fit_train, feature_test, pred.type = 'class')); 
                          prob_pred <- test(fit_train, feature_test, pred.type = 'response')})
}
## reweight the test data to represent a balanced label distribution
label_test <- as.integer(dat_test$label)
weight_test <- rep(NA, length(label_test))
for (v in unique(label_test)){
  weight_test[label_test == v] = 0.5 * length(label_test) / length(label_test[label_test == v])
}

accu <- sum(weight_test * (label_pred == label_test)) / sum(weight_test)
tpr.fpr <- WeightedROC(prob_pred, label_test, weight_test)
auc <- WeightedAUC(tpr.fpr)


cat("The accuracy of model:", model_labels[which.min(res_cv$mean_error)], "is", accu*100, "%.\n")
cat("The AUC of model:", model_labels[which.min(res_cv$mean_error)], "is", auc, ".\n")

Summarize Running Time

Prediction performance matters, so does the running times for constructing features and for training the model, especially when the computation resource is limited.

cat("Time for constructing training features=", tm_feature_train[1], "s \n")
cat("Time for constructing testing features=", tm_feature_test[1], "s \n")
cat("Time for training model=", tm_train[1], "s \n") 
cat("Time for testing model=", tm_test[1], "s \n")

5.Advanced model

5.1 Random Forest

Random forest is an ensemble machine learning model. The Random Forest model is a collection of several decision trees. These trees come together to a combined decision to give the output. The Random Forest model should have trees which are uncorrelated.

When data points are extremely complicated in arrangement, a simple classification model like Logistic Regression may underfit the model and make wrong predictions. Further, the time taken for the system to process it would also be considerably high. Therefore, we are losing out on both ends. In such scenario, Random Forest can realize that and have more accurate predictions. Further, overfitting is avoided and the model can also handle missing values.

5.1.1 cross-validation

After the cross validation process, we don’t see any remarkable difference between groups. It may due to the feature of the random forest which have trees are uncorrelated. So basically, we use the data after cross-validation to avoid any potential issues later.

library(AUC)
run.cv.rf<-FALSE
source("../lib/cross_validarion_rf.R")
source("../lib/rand_f.R")
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label) 

ntrees<-c(100,200,300,400,500)
mtrys<-c(5,10)
if(run.cv.rf){
  res_cv_rf <- matrix(0, nrow = length(ntrees)*length(mtrys), ncol = 2)
  for(i in 1:length(mtrys)){
    cat("ntree = ", ntrees[i], "\n")
    for(j in 1:length(ntrees)){
    cat("mtry = ", mtrys[j], "\n")
    res_cv_rf[i,] <- cv.function_rf(
      features = feature_train, labels = label_train, K, ntree = ntrees[j],mtry=mtrys[i])
  }}
  save(res_cv_rf, file="../output/res_cv_RF.RData")
  
}else{
  #load("../output/res_cv_RF.RData")
}
5.1.2 Choosing Random Forest tunning parameter

By running the model with different parameter with same training data. We get several different number of accuracy and AUC. Since we do not need to consider overfitting for the random forest model and in the seek of time for the training model and test model, we decided to choose the parameter with the highest accuracy and AUC with the minimum number of trees and mtrys.

So we choose the parameter, ntrees = 100, mtry =20.

source("../lib/rand_f.R")
# training rf
# ntree=1000,  nodesize=25, samplesize=nrow(trainx)
run.train.RF<-FALSE
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label) 
ntrees<-c(25,100,300,500,800)
mtrys<-c(20)
run.find.RF<-TRUE
if (run.find.RF) {
     res_pa_rf <- matrix(NA,nrow=5,ncol=2)
     for(i in 1:length(ntrees)){
         randomforest_fit <- train_RF(feature_train, label_train, ntree = ntrees[i], mtry = mtrys)
         label_pred <- as.integer(test_RF(randomforest_fit, feature_test))
         label_pred <- ifelse(label_pred == 2, 0, 1)
         accu_rf = sum(label_pred==dat_test$label)/length(dat_test$label)
         res_pa_rf[i,1]<-accu_rf
         tpr.fpr <- WeightedROC(label_pred, dat_test$label)
         auc_rf <- WeightedAUC(tpr.fpr)
         res_pa_rf[i,2]<-auc_rf
         }
      save(res_pa_rf, file="../output/res_pa_RF.RData")
}else{
   load("../output/res_pa_RF.RData")
}
     

visualization of the parameter to choose

load("../output/res_pa_RF.RData")
plot(ntrees, res_pa_rf[,1], type = "l",ylim=c(0.76,0.795))
points(ntrees,res_pa_rf[,2]+0.25)

5.1.3 Train the model with tunning parameter on original dataset

Train the model, save the data to the output rDATA with the parameter find before.

source("../lib/rand_f.R")
# training rf

run.train.RF<-TRUE
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label) 

if (run.train.RF) {
  tm_rfTrain <- system.time(
     randomforest_fit <- train_RF(feature_train, label_train, ntree = 100, mtry = 20)
)
save(randomforest_fit, file = "../output/train_randomforest.RData")
}
5.1.4 Test the model by train data on original dataset
# testing rf
tm_rf_train_test = NA

run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_rf_train_test <- system.time(
    label_pred_train <- as.integer(test_RF(randomforest_fit, feature_train))); 
}

label_pred_train <- ifelse(label_pred_train == 2, 1, 2)
accu_rf_train_test <- sum(label_pred_train==label_train)/length(label_train)

tpr.fpr_train <- WeightedROC(label_pred_train, label_train)
auc_rf_train <- WeightedAUC(tpr.fpr_train)
5.1.5 Test the model by test data on original dataset

Calculate testing time and accuracy on testing data

# testing rf
tm_test_RF = NA
feature_test <- as.matrix(dat_test[, -6007])
run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_test_RF <- system.time(
    label_pred <- as.integer(test_RF(randomforest_fit, feature_test))); 
}
5.1.6 Train the model with tunning parameter on resampled dataset

Train the model, save the data to the output rDATA with the parameter find before.

source("../lib/rand_f.R")
# training rf
tm_rfTrain_resample<-NA
run.train.RF<-TRUE

if (run.train.RF) {
  tm_rfTrain_resample <- system.time(
     randomforest_fit_resample <- train_RF(feature_train_new, label_train_new, ntree = 100, mtry = 20)
)
save(randomforest_fit_resample, file = "../output/train_randomforest_resample.RData")
}
5.1.7 Test the model by train data on resampled dataset
# testing rf
tm_rf_train_test_resample = NA

run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest_resample.RData")
  tm_rf_train_test_resample <- system.time(
    label_pred_train_resample <- as.integer(test_RF(randomforest_fit_resample, feature_train_new))); 
}

label_pred_train_resample <- ifelse(label_pred_train_resample == 2, 1, 2)
accu_rf_train_test_resample <- sum(label_pred_train_resample==label_train_new)/length(label_train_new)

tpr.fpr_train_resample <- WeightedROC(label_pred_train_resample, label_train_new)
auc_rf_train_resample <- WeightedAUC(tpr.fpr_train_resample)
5.1.8 Test the resampled model by test data on test dataset

Calculate testing time and accuracy on testing data

# testing rf
tm_test_RF_resample = NA
feature_test <- as.matrix(dat_test[, -6007])
run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_test_RF_resample <- system.time(
    label_pred_resample <- as.integer(test_RF(randomforest_fit_resample, feature_test))); 
}
5.1.9 Summary of the random forest model
cat("The unweighted accuracy of the random forest model on train data is ", accu_rf_train_test*100, "%.\n")
The unweighted accuracy of the random forest model on train data is  91.58333 %.
cat("The unweighted AUC of the random forest model on train data is ", auc_rf_train, ".\n")
The unweighted AUC of the random forest model on train data is  0.7832618 .
cat("The unweighted accuracy of the random forest model on test data is ", accu_rf_test_test*100, "%.\n")
The unweighted accuracy of the random forest model on test data is  78.66667 %.
cat("The unweighted AUC of the random forest model on test data is ", auc_rf_test, ".\n")
The unweighted AUC of the random forest model on test data is  0.5205905 .
cat("The resampled accuracy of the random forest model on train data is ", accu_rf_train_test_resample*100, "%.\n")
The resampled accuracy of the random forest model on train data is  98.88803 %.
cat("The resampled AUC of the random forest model on train data is ", auc_rf_train_resample, ".\n")
The resampled AUC of the random forest model on train data is  0.9888831 .
cat("The resampled accuracy of the random forest model on test data is ", accu_rf_test_test_resample*100, "%.\n")
The resampled accuracy of the random forest model on test data is  94.66667 %.
cat("The resampled AUC of the random forest model on test data is ", auc_rf_test_resample, ".\n")
The resampled AUC of the random forest model on test data is  0.8787879 .
cat("Time for training model Random Forest = ", tm_rfTrain[1], "s \n")
Time for training model Random Forest =  118.398 s 
cat("Time for test model Random Forest on train data is ",tm_rf_train_test[1] , "s \n")
Time for test model Random Forest on train data is  0.746 s 
cat("Time for test model Random Forest on test data is ",tm_test_RF[1] , "s \n")
Time for test model Random Forest on test data is  0.176 s 

#####The test day code

# testing rf
tm_test_RF_td = NA
feature_test_td <- as.matrix(td[, -6007])
run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_test_RF_td <- system.time(
    label_pred_td <- as.integer(test_RF(randomforest_fit_resample, feature_test_td)))
   write.csv(label_pred_td, "mydata.csv")
}

###Reference - Du, S., Tao, Y., & Martinez, A. M. (2014). Compound facial expressions of emotion. Proceedings of the National Academy of Sciences, 111(15), E1454-E1462.

---
title: "Main"
author: "Chengliang Tang, Yujie Wang, Diane Lu, Tian Zheng"
output:
  pdf_document: default
  html_notebook: default
---

In your final repo, there should be an R markdown file that organizes **all computational steps** for evaluating your proposed Facial Expression Recognition framework. 

This file is currently a template for running evaluation experiments. You should update it according to your codes but following precisely th    e same structure. 

```{r message=FALSE}
if(!require("EBImage")){
  install.packages("BiocManager")
  BiocManager::install("EBImage")
}
if(!require("R.matlab")){
  install.packages("R.matlab")
}
if(!require("readxl")){
  install.packages("readxl")
}

if(!require("dplyr")){
  install.packages("dplyr")
}
if(!require("readxl")){
  install.packages("readxl")
}

if(!require("ggplot2")){
  install.packages("ggplot2")
}

if(!require("caret")){
  install.packages("caret")
}

if(!require("glmnet")){
  install.packages("glmnet")
}

if(!require("WeightedROC")){
  install.packages("WeightedROC")
}

library(R.matlab)
library(readxl)
library(dplyr)
library(EBImage)
library(ggplot2)
library(caret)
library(glmnet)
library(WeightedROC)
```

### Step 0 set work directories
```{r wkdir, eval=FALSE}
set.seed(2020)
setwd("./")
# here replace it with your own path or manually set it in RStudio to where this rmd file is located. 
# use relative path for reproducibility
```

Provide directories for training images. Training images and Training fiducial points will be in different subfolders. 
```{r}
train_dir <- "../../train_set/" # This will be modified for different data sets.
train_image_dir <- paste(train_dir, "images/", sep="")
train_pt_dir <- paste(train_dir,  "points/", sep="")
train_label_path <- paste(train_dir, "label.csv", sep="") 
```

### Step 1: set up controls for evaluation experiments.

In this chunk, we have a set of controls for the evaluation experiments. 

+ (T/F) cross-validation on the training set
+ (T/F) reweighting the samples for training set 
+ (number) K, the number of CV folds
+ (T/F) process features for training set
+ (T/F) run evaluation on an independent test set
+ (T/F) process features for test set

```{r exp_setup}
run.cv <- TRUE # run cross-validation on the training set
sample.reweight <- FALSE # run sample reweighting in model training
K <- 5  # number of CV folds
run.feature.train <- TRUE # process features for training set
run.test <- TRUE # run evaluation on an independent test set
run.feature.test <- TRUE # process features for test set
```

Using cross-validation or independent test set evaluation, we compare the performance of models with different specifications. In this Starter Code, we tune parameter lambda (the amount of shrinkage) for logistic regression with LASSO penalty.

```{r model_setup}
lmbd = c(1e-3, 5e-3, 1e-2, 5e-2, 1e-1)
model_labels = paste("LASSO Penalty with lambda =", lmbd)
```

### Step 2: import data and train-test split 
```{r}
#train-test split
info <- read.csv(train_label_path)
n <- nrow(info)
n_train <- round(n*(4/5), 0)
train_idx <- sample(info$Index, n_train, replace = F)
test_idx <- setdiff(info$Index, train_idx) #get the index in info$Index but different from train_idx
```

If you choose to extract features from images, such as using Gabor filter, R memory will exhaust all images are read together. The solution is to repeat reading a smaller batch(e.g 100) and process them. 
```{r}
n_files <- length(list.files(train_image_dir))

image_list <- list()
for(i in 1:100){
   image_list[[i]] <- readImage(paste0(train_image_dir, sprintf("%04d", i), ".jpg"))
}
```

Fiducial points are stored in matlab format. In this step, we read them and store them in a list.
```{r read fiducial points}
#function to read fiducial points
#input: index
#output: matrix of fiducial points corresponding to the index
readMat.matrix <- function(index){
     return(round(readMat(paste0(train_pt_dir, sprintf("%04d", index), ".mat"))[[1]],0))
}

#load fiducial points
fiducial_pt_list <- lapply(1:n_files, readMat.matrix)
save(fiducial_pt_list, file="../output/fiducial_pt_list.RData")
```

### Step 3: construct features and responses

+ The follow plots show how pairwise distance between fiducial points can work as feature for facial emotion recognition.

  + In the first column, 78 fiducials points of each emotion are marked in order. 
  + In the second column distributions of vertical distance between right pupil(1) and  right brow peak(21) are shown in  histograms. For example, the distance of an angry face tends to be shorter than that of a surprised face.
  + The third column is the distributions of vertical distances between right mouth corner(50)
and the midpoint of the upper lip(52).  For example, the distance of an happy face tends to be shorter than that of a sad face.

![Figure1](../figs/feature_visualization.jpg)

`feature.R` should be the wrapper for all your feature engineering functions and options. The function `feature( )` should have options that correspond to different scenarios for your project and produces an R object that contains features and responses that are required by all the models you are going to evaluate later. 
  
  + `feature.R`
  + Input: list of images or fiducial point
  + Output: an RData file that contains extracted features and corresponding responses

```{r feature}
source("../lib/feature.R")
tm_feature_train <- NA
if(run.feature.train){
  tm_feature_train <- system.time(dat_train <- feature(fiducial_pt_list, train_idx))
  save(dat_train, file="../output/feature_train.RData")
}else{
  load(file="../output/feature_train.RData")
}

tm_feature_test <- NA
if(run.feature.test){
  tm_feature_test <- system.time(dat_test <- feature(fiducial_pt_list, test_idx))
  save(dat_test, file="../output/feature_test.RData")
}else{
  load(file="../output/feature_test.RData")
}


```

### Step 4: Train a classification model with training features and responses
Call the train model and test model from library. 

`train.R` and `test.R` should be wrappers for all your model training steps and your classification/prediction steps. 

+ `train.R`
  + Input: a data frame containing features and labels and a parameter list.
  + Output:a trained model
+ `test.R`
  + Input: the fitted classification model using training data and processed features from testing images 
  + Input: an R object that contains a trained classifier.
  + Output: training model specification

+ In this Starter Code, we use logistic regression with LASSO penalty to do classification. 

```{r loadlib}
source("../lib/train.R") 
source("../lib/test.R")
```

#### Model selection with cross-validation
* Do model selection by choosing among different values of training model parameters.

```{r runcv}
source("../lib/cross_validation.R")
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label)
feature_train_new <- data.frame(read.csv("../"))
label_train_new <- data.frame(read.csv("../"))
feature_train_new <- as.matrix(feature_train_new[, -1])
label_train_new <- as.matrix(label_train_new[, -1])

```

```{r lasso }
if(run.cv){
  res_cv <- matrix(0, nrow = length(lmbd), ncol = 4)
  for(i in 1:length(lmbd)){
    cat("lambda = ", lmbd[i], "\n")
    res_cv[i,] <- cv.function(features = feature_train, labels = label_train, K, 
                              l = lmbd[i], reweight = sample.reweight)
  save(res_cv, file="../output/res_cv.RData")
  }
}else{
  load("../output/res_cv.RData")
}
```

Visualize cross-validation results. 
```{r cv_vis}
  
res_cv <- as.data.frame(res_cv) 
colnames(res_cv) <- c("mean_error", "sd_error", "mean_AUC", "sd_AUC")
res_cv$k = as.factor(lmbd)

if(run.cv){
  p1 <- res_cv %>% 
    ggplot(aes(x = as.factor(lmbd), y = mean_error,
               ymin = mean_error - sd_error, ymax = mean_error + sd_error)) + 
    geom_crossbar() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  p2 <- res_cv %>% 
    ggplot(aes(x = as.factor(lmbd), y = mean_AUC,
               ymin = mean_AUC - sd_AUC, ymax = mean_AUC + sd_AUC)) + 
    geom_crossbar() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  print(p1)
  print(p2)
}


```


* Choose the "best" parameter value
```{r best_model}
par_best <- lmbd[which.min(res_cv$mean_error)] # lmbd[which.max(res_cv$mean_AUC)]
```

* Train the model with the entire training set using the selected model (model parameter) via cross-validation.
```{r final_train}
# training weights
weight_train <- rep(NA, length(label_train))
for (v in unique(label_train)){
  weight_train[label_train == v] = 0.5 * length(label_train) / length(label_train[label_train == v])
}
if (sample.reweight){
  tm_train <- system.time(fit_train <- train(feature_train, label_train, w = weight_train, par_best))
} else {
  tm_train <- system.time(fit_train <- train(feature_train, label_train, w = NULL, par_best))
}
save(fit_train, file="../output/fit_train.RData")
```

### Step 5: Run test on test images
```{r test}
tm_test = NA
feature_test <- as.matrix(dat_test[, -6007])
if(run.test){
  load(file="../output/fit_train.RData")
  tm_test <- system.time({label_pred <- as.integer(test(fit_train, feature_test, pred.type = 'class')); 
                          prob_pred <- test(fit_train, feature_test, pred.type = 'response')})
}
```


* evaluation
```{r}
## reweight the test data to represent a balanced label distribution
label_test <- as.integer(dat_test$label)
weight_test <- rep(NA, length(label_test))
for (v in unique(label_test)){
  weight_test[label_test == v] = 0.5 * length(label_test) / length(label_test[label_test == v])
}

accu <- sum(weight_test * (label_pred == label_test)) / sum(weight_test)
tpr.fpr <- WeightedROC(prob_pred, label_test, weight_test)
auc <- WeightedAUC(tpr.fpr)


cat("The accuracy of model:", model_labels[which.min(res_cv$mean_error)], "is", accu*100, "%.\n")
cat("The AUC of model:", model_labels[which.min(res_cv$mean_error)], "is", auc, ".\n")


```

### Summarize Running Time
Prediction performance matters, so does the running times for constructing features and for training the model, especially when the computation resource is limited. 
```{r running_time}
cat("Time for constructing training features=", tm_feature_train[1], "s \n")
cat("Time for constructing testing features=", tm_feature_test[1], "s \n")
cat("Time for training model=", tm_train[1], "s \n") 
cat("Time for testing model=", tm_test[1], "s \n")
```



#### 5.Advanced model
#### 5.1 Random Forest
Random forest is an ensemble machine learning model. The Random Forest model is a collection of several decision trees. These trees come together to a combined decision to give the output. The Random Forest model should have trees which are uncorrelated.

When data points are extremely complicated in arrangement, a simple classification model like Logistic Regression may underfit the model and make wrong predictions. Further, the time taken for the system to process it would also be considerably high. Therefore, we are losing out on both ends. In such scenario, Random Forest can realize that and have more accurate predictions. Further, overfitting is avoided and the model can also handle missing values. 

######  5.1.1 cross-validation
After the cross validation process, we don't see any remarkable difference between groups. It may due to the feature of the random forest which have trees are uncorrelated. So basically, we use the data after cross-validation to avoid any potential issues later.
```{r choosingpara}
library(AUC)
run.cv.rf<-FALSE
source("../lib/cross_validarion_rf.R")
source("../lib/rand_f.R")
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label) 

ntrees<-c(100,200,300,400,500)
mtrys<-c(5,10)
if(run.cv.rf){
  res_cv_rf <- matrix(0, nrow = length(ntrees)*length(mtrys), ncol = 2)
  for(i in 1:length(mtrys)){
    cat("ntree = ", ntrees[i], "\n")
    for(j in 1:length(ntrees)){
    cat("mtry = ", mtrys[j], "\n")
    res_cv_rf[i,] <- cv.function_rf(
      features = feature_train, labels = label_train, K, ntree = ntrees[j],mtry=mtrys[i])
  }}
  save(res_cv_rf, file="../output/res_cv_RF.RData")
  
}else{
  #load("../output/res_cv_RF.RData")
}



```

###### 5.1.2 Choosing Random Forest tunning parameter
By running the model with different parameter with same training data. We get several different number of accuracy and AUC. Since we do not need to consider overfitting for the random forest model and in the seek of time for the training model and test model, we decided to choose the parameter with the highest accuracy and AUC with the minimum number of trees and mtrys.

So we choose the parameter, ntrees = 100, mtry =20. 
```{r find para}
source("../lib/rand_f.R")
# training rf
# ntree=1000,  nodesize=25, samplesize=nrow(trainx)
run.train.RF<-FALSE
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label) 
ntrees<-c(25,100,300,500,800)
mtrys<-c(20)
run.find.RF<-TRUE
if (run.find.RF) {
     res_pa_rf <- matrix(NA,nrow=5,ncol=2)
     for(i in 1:length(ntrees)){
         randomforest_fit <- train_RF(feature_train, label_train, ntree = ntrees[i], mtry = mtrys)
         label_pred <- as.integer(test_RF(randomforest_fit, feature_test))
         label_pred <- ifelse(label_pred == 2, 0, 1)
         accu_rf = sum(label_pred==dat_test$label)/length(dat_test$label)
         res_pa_rf[i,1]<-accu_rf
         tpr.fpr <- WeightedROC(label_pred, dat_test$label)
         auc_rf <- WeightedAUC(tpr.fpr)
         res_pa_rf[i,2]<-auc_rf
         }
      save(res_pa_rf, file="../output/res_pa_RF.RData")
}else{
   load("../output/res_pa_RF.RData")
}
     


```
visualization of the parameter to choose
```{r load para}
load("../output/res_pa_RF.RData")
plot(ntrees, res_pa_rf[,1], type = "l",ylim=c(0.76,0.795))
points(ntrees,res_pa_rf[,2]+0.25)
```

###### 5.1.3 Train the model with tunning parameter on original dataset
Train the model, save the data to the output rDATA with the parameter find before. 
```{r train Random Forest }
source("../lib/rand_f.R")
# training rf

run.train.RF<-TRUE
feature_train = as.matrix(dat_train[, -6007])
label_train = as.integer(dat_train$label) 

if (run.train.RF) {
  tm_rfTrain <- system.time(
     randomforest_fit <- train_RF(feature_train, label_train, ntree = 100, mtry = 20)
)
save(randomforest_fit, file = "../output/train_randomforest.RData")
}
```


###### 5.1.4 Test the model by train data on original dataset
```{r testtrain}
# testing rf
tm_rf_train_test = NA

run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_rf_train_test <- system.time(
    label_pred_train <- as.integer(test_RF(randomforest_fit, feature_train))); 
}

label_pred_train <- ifelse(label_pred_train == 2, 1, 2)
accu_rf_train_test <- sum(label_pred_train==label_train)/length(label_train)

tpr.fpr_train <- WeightedROC(label_pred_train, label_train)
auc_rf_train <- WeightedAUC(tpr.fpr_train)
```

###### 5.1.5 Test the model by test data on original dataset
Calculate testing time and accuracy on testing data
```{r test Random Forest}
# testing rf
tm_test_RF = NA
feature_test <- as.matrix(dat_test[, -6007])
run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_test_RF <- system.time(
    label_pred <- as.integer(test_RF(randomforest_fit, feature_test))); 
}

```


```{r evaluation_rf, echo=FALSE}
label_pred1 <- ifelse(label_pred == 2, 0, 1)
accu_rf_test_test = sum(label_pred1==dat_test$label)/length(dat_test$label)
tpr.fpr_test <- WeightedROC(label_pred1, dat_test$label)
auc_rf_test <- WeightedAUC(tpr.fpr_test)
```

###### 5.1.6 Train the model with tunning parameter on resampled dataset
Train the model, save the data to the output rDATA with the parameter find before. 
```{r train Random Forest resample}
source("../lib/rand_f.R")
# training rf
tm_rfTrain_resample<-NA
run.train.RF<-FALSE

if (run.train.RF) {
  tm_rfTrain_resample <- system.time(
     randomforest_fit_resample <- train_RF(feature_train_new, label_train_new, ntree = 100, mtry = 20)
)
save(randomforest_fit_resample, file = "../output/train_randomforest_resample.RData")
}
```


###### 5.1.7 Test the model by train data on resampled dataset
```{r testtrain resample}
# testing rf
tm_rf_train_test_resample = NA

run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest_resample.RData")
  tm_rf_train_test_resample <- system.time(
    label_pred_train_resample <- as.integer(test_RF(randomforest_fit_resample, feature_train_new))); 
}

label_pred_train_resample <- ifelse(label_pred_train_resample == 2, 1, 2)
accu_rf_train_test_resample <- sum(label_pred_train_resample==label_train_new)/length(label_train_new)

tpr.fpr_train_resample <- WeightedROC(label_pred_train_resample, label_train_new)
auc_rf_train_resample <- WeightedAUC(tpr.fpr_train_resample)
```

###### 5.1.8 Test the resampled model by test data on test dataset
Calculate testing time and accuracy on testing data
```{r test Random Forest resample}
# testing rf
tm_test_RF_resample = NA
feature_test <- as.matrix(dat_test[, -6007])
run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_test_RF_resample <- system.time(
    label_pred_resample <- as.integer(test_RF(randomforest_fit_resample, feature_test))); 
}

```


```{r evaluation_rfresample, echo=FALSE}
label_pred1_resample <- ifelse(label_pred_resample == 2, 0, 1)
accu_rf_test_test_resample = sum(label_pred1_resample==dat_test$label)/length(dat_test$label)
tpr.fpr_test_resample <- WeightedROC(label_pred1_resample, dat_test$label)
auc_rf_test_resample <- WeightedAUC(tpr.fpr_test_resample)
```



###### 5.1.9 Summary of the random forest model
```{r summary rf}

cat("The unweighted accuracy of the random forest model on train data is ", accu_rf_train_test*100, "%.\n")
cat("The unweighted AUC of the random forest model on train data is ", auc_rf_train, ".\n")
cat("The unweighted accuracy of the random forest model on test data is ", accu_rf_test_test*100, "%.\n")
cat("The unweighted AUC of the random forest model on test data is ", auc_rf_test, ".\n")
cat("The resampled accuracy of the random forest model on train data is ", accu_rf_train_test_resample*100, "%.\n")
cat("The resampled AUC of the random forest model on train data is ", auc_rf_train_resample, ".\n")
cat("The resampled accuracy of the random forest model on test data is ", accu_rf_test_test_resample*100, "%.\n")
cat("The resampled AUC of the random forest model on test data is ", auc_rf_test_resample, ".\n")

cat("Time for training model Random Forest = ", tm_rfTrain[1], "s \n")
cat("Time for test model Random Forest on train data is ",tm_rf_train_test[1] , "s \n")
cat("Time for test model Random Forest on test data is ",tm_test_RF[1] , "s \n")
```





#####The test day code
```{r test day}
# testing rf
tm_test_RF_td = NA
feature_test_td <- as.matrix(td[, -6007])
run.test.RF<-TRUE
if(run.test.RF){
  load(file="../output/train_randomforest.RData")
  tm_test_RF_td <- system.time(
    label_pred_td <- as.integer(test_RF(randomforest_fit_resample, feature_test_td)))
   write.csv(label_pred_td, "mydata.csv")
}

```














###Reference
- Du, S., Tao, Y., & Martinez, A. M. (2014). Compound facial expressions of emotion. Proceedings of the National Academy of Sciences, 111(15), E1454-E1462.













